
STOP 



Early Journal Content on JSTOR, Free to Anyone in the World 

This article is one of nearly 500,000 scholarly works digitized and made freely available to everyone in 
the world by JSTOR. 

Known as the Early Journal Content, this set of works include research articles, news, letters, and other 
writings published in more than 200 of the oldest leading academic journals. The works date from the 
mid-seventeenth to the early twentieth centuries. 

We encourage people to read and share the Early Journal Content openly and to tell others that this 
resource exists. People may post this content online or redistribute in any way for non-commercial 
purposes. 

Read more about Early Journal Content at http://about.jstor.org/participate-jstor/individuals/early- 
journal-content . 



JSTOR is a digital library of academic journals, books, and primary source objects. JSTOR helps people 
discover, use, and build upon a wide range of content through a powerful research and teaching 
platform, and preserves this content for future generations. JSTOR is part of ITHAKA, a not-for-profit 
organization that also includes Ithaka S+R and Portico. For more information about JSTOR, please 
contact support@jstor.org. 



NEW SERIES, NO. 134 (VOL. XVII) JUNE, 1921 

QUARTERLY PUBLICATION OF THE 

AMERICAN 
STATISTICAL ASSOCIATION 

THE COEFFICIENT OF CORRELATION 
By Franz Boas, Columbia University. 



Students who are using the coefficient of correlation are aware that 
the value of the coefficient does not always express the biological, psy- 
chological, social, or economic relations which are the subjects of our 
studies; they realize that its value depends upon a good many extrane- 
ous conditions. Since, however, in literature the coefficient is too 
often used in a mechanical way, and the results of study are generally 
interpreted as having an immediate bearing upon the problems at issue, 
without a detailed study of the composition of the series, it seems de- 
sirable to call attention to some of the conditions influencing its value. 

When two variables are interdependent so that a variant of one 
partly determines the correlated array of the other, we express their re- 
lation by the coefficient of correlation. The derivation of this value is 
based on the assumption that the averages of arrays are proportional to 
the determining variant, each being measured from its own average, 
and on the further assumption that all arrays have equal variability. 
The former assumption is expressed by 

(la) [y] = Q&, (brackets indicating averages) 

and conversely by 

(16) [x] = m- 

From these we derive by multiplication 



(2a) 


[xy] = qi[x*\ = q lt r x 2 




and 


(2b) 


[xy]=qi[y 2 ] = q2<r v 2 . 


It follows that 





qi- = q^ 



684 American Statistical Association [2 

and we call this value r, the coefficient of correlation. Hence 

(3) r=™ 

If we call the variability of an array of y around its own average s, 
then the mean square variation of the array around the average of the 
total series to which the array belongs will be 

s y 2 +qix 2 

The total mean square variability of y, which is <r y 2 , must therefore 
equal 

(4a) cr y 2 = s y 2 +q 1 2 <r x i 

(46) c y 2 = s y 2 +r 2 a 2 , or 

(4c) s 2 = c y 2 {\ — r 2 ) In the same way 

(4d) s 2 = a 2 (l-r 2 ) 

In the series of these values, we are operating therefore with the fol- 
lowing seven quantities, <r x , <r y , s x , s y , q h q 2 , r. 

There are certain relations between these quantities. 



qm = r 2 



(5)i 



Sx 2 = <r x 2 (l—r 2 ) = a 2 — q£o 2 
s y 2 = (T y 2 (l—r 2 )=cr y 2 -qi 2 (r x 2 



q\o£ = qi<iy- 



Since these four equations determine relations between these seven 
values, there are only three that are independent. The character of the 
problem to be investigated determines which among the seven values 
may be considered as independent factors. To give an example: 
When we investigate proportions of the body we may recognize that 
in a group of common descent individuals of a certain stature have a 
certain average weight which has a certain variability. The relation is 
a physiological and anatomical one, and the stature partly determines 
the weight. The three values, the relation of which determines the 
other values are, therefore, a x which expresses the range of stature of 
the individuals examined; s the variability of the independent array of 
weights; and qi the coefficient which determines the average weight as 
determined by stature. All the other values must be expressed in 
terms of these three. 

<r v 2 = s v 2 +qi 2 <r x 2 



3] The Coefficient of Correlation 685 



qnr x * 



r — 



q.\o x 



V. 



1 



(6) 



-v/i. 



qiW 



If, for any reason, there is a selection of values of a;, the coefficient of 
correlation will change its value, although the actual functional rela- 
tion between x and y remains the same. In other words, in cases of this 
type, the coefficient of correlation is an artificial value, an algebraic 
convenience which depends upon the composition of the series. It fol- 
lows from this that, for instance, in military statistics in which a se- 
lected group is examined, the coefficient of correlation for the selected 
group cannot coincide with the coefficient for the total series. 

It must be remembered that in this case § 2 will no longer be constant, 
since the arrays are disturbed by the selection. The general theory, 
however, remains unchanged if we substitute in equation (2&) 

and utilize qj in place of qz. It will be seen that in this case the second 
assumption, namely, the equality of the variability of the arrays of y, 
does not hold good. If there is a selection with regard to both variables, 
as in military statistics, when individuals of underweight and of under- 
stature are eliminated, neither q± nor q% can be determined according 
to the method here discussed. We must substitute for equation (4a), 

(7) <tf = [VI +[(«*)*] 

then equation (6) takes the form 



-v/ 



H- W] 



[(qix)*] 

In these cases, the essential problem will be not the determination of 
r, but the determination of the two values of q. 

If we had a selection in which only one end of the distribution were 
eliminated the problem could easily be solved. In most cases, however, 
the cutting off of one end will affect to a greater or lesser extent the ad- 
joining part of the distribution curve. This may be seen clearly in the 
distribution of statures in the Italian army as given by Livi in his 



686 American Statistical Association [4 

Antropometria Militate. The lowest stature is 154 c, but those just 
above 154 c. are also reduced in numbers because individuals of low 
chest measurements and other deficiencies will be more common among 
those individuals whose stature is a little over 154 than among tall ones. 
It does not seem, therefore, worth while to carry through a detailed cal- 
culation of the general problem because the characteristics of the dis- 
tribution will have to be investigated in each particular case. 

The dependence of the coefficient of correlation upon the distribution 
of the variables appears particularly in cases of tetrachoric distributions 
of disparate forms. We assume two alternative pairs of forms : a with 
the probability of occurrence p a ,a,nd not a with the probability of occur- 
rence q a ; and b and not b with the respective probabilities of occurrence 
p b and q b . Among all the cases of occurrence of a, b combined with 
a has the probability f ab ; among all the cases of non-occurence of a, b 
combined with not a the probability f ob . In the same way, among all 
the cases of occurrence of b, a combined with b has the probability f' ab ; 
among all the cases of non-occurrence of b, a combined with not b the 
probability f' oa . 

When in a particular series the form a appears p a +x times, the form 
not a will appear q a — x times. In such a group of series b will occur on 
the average 

[p b +y]=fab(Pa+x)+fob(q a -x) times; and 
ly]=x(f ab -fob) = cx. 
In the same way [x] = y (f' ab —/'<,„) = c'y. 

Hence [xy] = c[x 2 ] = c'[y 2 ] 



since 



M-M-, [y >]=m 



[xy]=c p^ =C 'm 




p a q a ■ Pbqb 



In this case c and c' are quantities determined by the character of the 
series: p a and p b express the relative frequencies of the two types 
treated. It appears, therefore, that r is determined by these frequen- 



5] The Coefficient of Correlation 687 

cies. To give an example: Given a large bag of black (np a ) and white 
(nq a ) pieces of cardboard; these pieces are in part round (np b ), in part 
square (nq b ); npafat, pieces are black and round; nqj 0b are white and 
round; np b f' ab are round and black; and nq b f oa are square and black. 
Then even if the values of / remain the same, the coefficient of correla- 
tion is determined by the frequencies p a and p b . It will be understood 
that the values of / are all interdependent. We may write 

fab = Pb+~ 5 f'ab = Pa+ — 

Pa Pb 

, _ 5 f> — ^ 

Job Pb > J oa Pa 

Q.a q b 

Then, as is easily shown, for a series of n cases 

S = [xy] 

It is obvious that this proof holds good only in cases of alternative 
qualities. When we have a tetrachoric distribution based on an artifi- 
cial division of a range of continuous variation, the values e and c' 
are not constant. 

I may point out that the proof given here proves the correctness of 
my previous discussion of this subject (Science, N. S. Volume 29, 1909, 
pp. 823-824). It does not seem to me that the criticism of the theory 
made by Karl Pearson (Science, N. S. Volume 30, 1909, p. 23) and by 
Karl Pearson and David Heron (Biometrika, Vol. 9, 1913, p. 166 and 
Vol. 8, 1911, p. 114) is valid. This criticism was repeated later on by 
Arnold Lang (Experimentelle Vererbungslehre in der Zoologie, seit 1900, 
Erste Halfte, Jena, 1914, p. 419) on the basis of Poniatowski's remarks. 
I have discussed this subject more fully in a general report on Anthro- 
pometric Investigations of the Population of the United States. 

A second case in which the purely conventional meaning of the 
coefficient of correlation appears clearly is that of mixed series. I have 
treated this problem at another place* and it may be sufficient to repeat 
the general argument as applied to a particular case. If we have a 
mixed population of round-headed and long-headed individuals who 
may have intermarried, and among whom there is a tendency of re- 
version to parental types; if, furthermore, the long heads in this series 
are in absolute measures long and narrow, the round heads absolutely 
short and wide, then short heads will, on the whole, be wide, and long 
heads narrow, and we find a negative correlation; while in a homogene- 
ous series there will be a positive correlation, because both absolute 
measures increase with absolute size of the head. In cases of this type, 

* "The Cephalic Index," American Anthropologist, N. S. 1:448-461. 



688 American Statistical Association [6 

the coefficient of correlation is a valuable means of determining the com- 
posite character of the series. If the character of the homogeneous 
series is known, it may even enable us to determine the character of the 
composition, but it does not express the inner relation between two 
values. 

Correlations may thus merely be expressions of the variability of a 
series. At another place* I have demonstrated the relation between 
the coefficient of correlation expressing the similarity of brothers and 
the variability of family groups. For large fraternities or whole popu- 
lations, this relation is easily demonstrated. If a number of fraterni- 
ties or local units are distributed, each around its own average, and the 
averages themselves are distributed according to the laws of chance, we 
may call the deviation of the average of a particular population or fra- 
ternity from the general average x; that of an individual in each popu- 
lation or fraternity from its own average y. Then, the correlation will 
depend upon the products 

(x+y)(x+y 1 ) 
which have to be taken in each group. Since the individuals in each 
group are independent 

l(x+y)(x+yi)] = [x*] = <T : * 
and 

<xK 
r = 



In other words, the coefficient of correlation is determined by the 
variability of the averages of the separate groups and that of the 
individuals in each group. 

* *'0n the variety of Lines of Descent represented in a Population," American Anthropologist, N. S. 
18: 1-9. 



